# Multimodal Pretraining

Vit So400m Patch14 Siglip Gap 896.pali Pt
Apache-2.0
Vision model based on SigLIP image encoder, employing global average pooling, part of the PaliGemma project
Text-to-Image Transformers
V
timm
15
1
Vit So400m Patch14 Siglip Gap 896.pali2 3b Pt
Apache-2.0
A vision model based on the SigLIP image encoder, employing global average pooling, and part of the PaliGemma2 project
Text-to-Image Transformers
V
timm
14
1
Vit So400m Patch14 Siglip Gap 448.pali Mix
Apache-2.0
A vision-language model based on the SigLIP image encoder, utilizing global average pooling, suitable for multimodal tasks.
Text-to-Image Transformers
V
timm
15
0
Vit Base Patch16 Siglip Gap 224.webli
Apache-2.0
Vision Transformer model based on SigLIP, containing only the image encoder part, employing a global average pooling strategy
Image Classification Transformers
V
timm
178
1
Vit Large Patch14 Clip 224.datacompxl
Apache-2.0
A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, released by the LAION organization.
Image Classification Transformers
V
timm
14
0
Convnext Base.clip Laion2b Augreg
Apache-2.0
ConvNeXt Base image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports image feature extraction
Image Classification Transformers
C
timm
522
0
Convnext Base.clip Laion2b
Apache-2.0
CLIP image encoder based on ConvNeXt architecture, trained by LAION, suitable for multimodal vision-language tasks
Image Classification Transformers
C
timm
297
0
Minivla Wrist Vq Libero90 Prismatic
MIT
MiniVLA is a vision-language-action model focused on robotics, supporting multimodal tasks from image-text to text.
Image-to-Text Transformers English
M
Stanford-ILIAD
18
0
Minivla History2 Vq Libero90 Prismatic
MIT
MiniVLA is a compact yet high-performance vision-language-action model, compatible with Prismatic VLMs training scripts, suitable for robotics and multimodal tasks.
Image-to-Text Transformers English
M
Stanford-ILIAD
22
1
Minivla Vq Libero90 Prismatic
MIT
MiniVLA is a lightweight vision-language model compatible with the Prismatic VLMs training framework, supporting multimodal tasks from image-text to text.
Image-to-Text Transformers English
M
Stanford-ILIAD
31
0
Cogact Base
MIT
CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.
Multimodal Fusion Transformers English
C
CogACT
6,589
12
Vit Base Patch16 Plus Clip 240.laion400m E31
MIT
A vision-language dual-purpose model trained on the LAION-400M dataset, supporting zero-shot image classification tasks
Image Classification
V
timm
37.23k
0
Merlin
MIT
Merlin is a 3D vision-language model for computed tomography, pretrained using both structured electronic health records and unstructured radiology reports.
Text-to-Image English
M
stanfordmimi
2,418
6
Openvla 7b Prismatic
MIT
OpenVLA 7B is an open-source visual-language-action model compatible with Prismatic VLMs training script format, supporting full fine-tuning of 7.5 billion parameters.
Image-to-Text Transformers English
O
openvla
156
5
LVM Ckpts
Apache-2.0
LVM is an innovative visual pretraining model that achieves large-scale visual learning by converting visual data into visual sentences and making predictions in an autoregressive manner.
Text-to-Image Transformers
L
Emma02
247
5
Blip2 Opt 6.7b
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation and visual question answering tasks.
Image-to-Text Transformers English
B
merve
26
2
Blip2 Test
MIT
BLIP-2 is a vision-language model based on OPT-2.7b, which achieves image-to-text generation by freezing the image encoder and large language model while training a query transformer.
Image-to-Text Transformers English
B
advaitadasein
18
0
Matcha Base
Apache-2.0
MatCha is a vision-language model focused on chart understanding and mathematical reasoning, enhancing processing capabilities through joint modeling of charts and language data
Text-to-Image Transformers Supports Multiple Languages
M
google
2,445
26
Matcha Plotqa V1
Apache-2.0
MatCha model fine-tuned on the PlotQA-v1 dataset, specializing in visual question answering tasks for charts, with excellent performance in chart deconstruction and numerical reasoning
Text-to-Image Transformers Supports Multiple Languages
M
google
83
3
Markuplm Base Finetuned Websrc
MarkupLM is a multimodal pretrained model designed for rich visual document understanding and information extraction tasks, combining text and markup language information.
Multimodal Fusion Transformers English
M
microsoft
168
10
Taiyi Roberta 124M D V2
Apache-2.0
A specially pretrained English multimodal text encoder based on RoBERTa-base architecture, trained with 1 million image-text pairs
Multimodal Fusion Transformers English
T
IDEA-CCNL
18
0
Taiyi Vit 87M D
Apache-2.0
An English MAP visual encoder specially pretrained on COCO and Visual Genome datasets, based on ViT-base architecture
Image-to-Text Transformers
T
IDEA-CCNL
24
0
Vilt B32 Mlm
Apache-2.0
ViLT is a vision-and-language Transformer model pretrained on the GCC+SBU+COCO+VG dataset, focusing on joint understanding tasks of images and text.
Text-to-Image Transformers
V
dandelin
7,761
11
Mengzi Oscar Base Retrieval
Apache-2.0
A Chinese image-text retrieval model fine-tuned on the COCO-ir dataset based on the Chinese multimodal pretraining model Mengzi-Oscar
Text-to-Image Transformers Chinese
M
Langboat
17
3
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase